LiveCodeBench Pro

How Olympiad medalists judge LLMs in competitive programming — a contamination-free benchmark from Codeforces, ICPC, and IOI where the best model scores 0% on hard problems

Published

September 5, 2025

Keywords: LiveCodeBench Pro, competitive programming benchmark, LLM coding evaluation, Olympiad medalists, Codeforces, ICPC, IOI, algorithmic reasoning, contamination-free, code generation, pass@1, frontier model evaluation

Introduction

Frontier LLMs are increasingly tested on coding benchmarks — but popular ones like HumanEval and MBPP have been saturated, and even the original LiveCodeBench is becoming routine for the strongest models. Recent reports claim AI now outperforms elite humans in competitive programming. But is that really true?

LiveCodeBench Pro was built to answer this question rigorously. Created by a team that includes Olympiad medalists from international algorithmic contests, it introduces a continuously updated benchmark of problems from Codeforces, ICPC, and IOI — with expert line-by-line error analysis of every model failure. The results are sobering: without external tools, the best model achieves only 53% on medium-difficulty problems and 0% on hard problems.

“High performance appears largely driven by implementation precision and tool augmentation, not superior reasoning. LiveCodeBench Pro thus highlights the significant gap to human grandmaster levels.” — LiveCodeBench Pro Paper

graph LR
    A["Traditional Code Benchmarks<br/>(HumanEval, MBPP)<br/>Saturated"] --> B["Benchmark<br/>Contamination"]
    B --> C["LiveCodeBench Pro<br/>Codeforces + ICPC + IOI<br/>Continuously updated"]
    C --> D["Meaningful signal<br/>for algorithmic<br/>reasoning"]

    style A fill:#e74c3c,stroke:#333,color:#fff
    style B fill:#f39c12,stroke:#333,color:#fff
    style C fill:#27ae60,stroke:#333,color:#fff
    style D fill:#3498db,stroke:#333,color:#fff

What Is LiveCodeBench Pro?

LiveCodeBench Pro is a competitive programming benchmark that evaluates LLMs on problems drawn from three elite contest platforms:

  • Codeforces — the world’s largest competitive programming platform
  • ICPC (International Collegiate Programming Contest) — the premier team programming competition
  • IOI (International Olympiad in Informatics) — the top individual programming competition for high schoolers

Unlike static benchmarks, LiveCodeBench Pro is continuously updated with new problems to reduce the likelihood of data contamination — a critical problem in code evaluation where models may have seen solutions during pretraining.

Key Characteristics

Feature Details
Problem sources Codeforces, ICPC, IOI
Update frequency Continuously updated (quarterly time windows)
Difficulty levels Easy, Medium, Hard
Evaluation metric pass@1 (single-attempt correctness)
Expert annotation Olympiad medalists annotate algorithmic categories
Error analysis Line-by-line analysis of failed model submissions
Anti-contamination New problems prevent data leakage

What Makes It Different from Other Code Benchmarks?

graph TD
    LCBPro["LiveCodeBench Pro"] --> E1["Expert-Annotated<br/>Olympiad medalists label<br/>algorithmic categories"]
    LCBPro --> E2["Fine-Grained Errors<br/>Line-by-line analysis<br/>of failures"]
    LCBPro --> E3["Continuously Updated<br/>New problems from<br/>live contests"]
    LCBPro --> E4["Multi-Source<br/>Codeforces + ICPC + IOI"]

    style LCBPro fill:#e74c3c,color:#fff,stroke:#333
    style E1 fill:#3498db,color:#fff,stroke:#333
    style E2 fill:#27ae60,color:#fff,stroke:#333
    style E3 fill:#f39c12,color:#fff,stroke:#333
    style E4 fill:#8e44ad,color:#fff,stroke:#333

Two standout features set LiveCodeBench Pro apart:

  1. Olympiad medalist annotations — Every problem is annotated for algorithmic categories (e.g., dynamic programming, graph theory, greedy) by medalists from international contests, providing fine-grained diagnostics
  2. Line-by-line failure analysis — When a model fails, medalists conduct a detailed analysis of why it failed, revealing patterns like confidently incorrect justifications and struggles with nuanced case analysis

Who Built It?

LiveCodeBench Pro was developed by a multi-institutional team of researchers and competitive programming medalists:

  • Zihan Zheng, Zerui Cheng, Zeyu Shen, Shang Zhou, Kaiyuan Liu, Hansen He, Dongruixuan Li, Stanley Wei, Hangyi Hao — Core researchers and competitive programming experts
  • Jianzhu Yao, Peiyao Sheng, Zixuan Wang, Wenhao Chai — Contributing researchers
  • Aleksandra Korolova, Peter Henderson — Academic advisors
  • Sanjeev Arora, Pramod Viswanath, Jingbo Shang, Saining Xie — Senior advisors

The team draws from institutions including Princeton University, UC San Diego, NYU, and other leading AI research groups.

Publication

The paper was published in June 2025 on arXiv, with the project page providing live leaderboard updates.

Resource Link
arXiv paper arxiv.org/abs/2506.11928
Project page livecodebenchpro.com

What Skills Does It Test?

LiveCodeBench Pro tests the full spectrum of algorithmic reasoning and competitive programming skills — the hardest coding capabilities to master.

graph TD
    LCBPro["LiveCodeBench Pro<br/>Competitive Programming"] --> A["Algorithm Design<br/>DP, greedy, divide<br/>& conquer"]
    LCBPro --> B["Graph Theory<br/>Shortest path, MST,<br/>network flow"]
    LCBPro --> C["Mathematical<br/>Reasoning<br/>Number theory,<br/>combinatorics"]
    LCBPro --> D["Data Structures<br/>Segment trees,<br/>balanced BSTs"]
    LCBPro --> E["Complex Case<br/>Analysis<br/>Edge cases,<br/>corner conditions"]
    LCBPro --> F["Implementation<br/>Precision<br/>Efficient, bug-free<br/>code"]

    style LCBPro fill:#e74c3c,color:#fff,stroke:#333
    style A fill:#3498db,color:#fff,stroke:#333
    style B fill:#27ae60,color:#fff,stroke:#333
    style C fill:#f39c12,color:#fff,stroke:#333
    style D fill:#8e44ad,color:#fff,stroke:#333
    style E fill:#e67e22,color:#fff,stroke:#333
    style F fill:#6cc3d5,color:#fff,stroke:#333

Capability What LiveCodeBench Pro Tests
Algorithmic reasoning Designing correct algorithms for novel problems under constraints
Implementation precision Writing bug-free, efficient code that handles all edge cases
Complex case analysis Identifying and handling nuanced corner cases that break naive solutions
Mathematical reasoning Number theory, combinatorics, and proof-based thinking
Advanced data structures Segment trees, Fenwick trees, balanced BSTs, and more
Problem decomposition Breaking complex problems into solvable subproblems

Key Findings from Medalist Analysis

The medalist team’s line-by-line analysis revealed critical patterns:

  • LLMs excel at implementation-heavy problems — tasks requiring clean, straightforward coding
  • LLMs struggle with nuanced algorithmic reasoning — problems requiring creative algorithm design
  • LLMs fail at complex case analysis — they miss subtle edge cases that human experts catch
  • LLMs generate confidently incorrect justifications — they provide plausible-sounding but wrong explanations for their approach
  • High performance is driven by tool augmentation, not superior reasoning — external tools mask reasoning weaknesses

Current Leaderboard

The leaderboard below shows model performance on LiveCodeBench Pro as displayed on the official project page. The default view shows the Hard difficulty level, which best reveals the gap between AI and human competitive programmers.

Source: LiveCodeBench Pro Leaderboard (consulted March 28, 2026). Continuously updated with new problems and models.

Hard Difficulty (pass@1)

Rank Model Accuracy (%)
1 Gemini 3 Deep Think 81.6
2 Gemini 3.1 Pro Preview 75.5
3 GPT-5.2 (high) 53.1
4 Gemini 3 Pro Preview 49.0
5 Gemini 3 Flash Preview 46.9
6 GPT-5 (high) 44.9
7 o4-mini (high) 32.7
8 Qwen3 Next 80B A3B (thinking) 14.3
9 DeepSeek R1 8.2
10 (other models) 4.1

Key takeaways:

  • Even the best model (Gemini 3 Deep Think) at 81.6% leaves ~20% of hard problems unsolved — and this represents the latest frontier with deep thinking capabilities
  • The original paper finding (June 2025) was even starker: 0% pass@1 on hard problems without external tools — progress since then reflects newer reasoning models
  • Massive gap between reasoning and non-reasoning models — o4-mini (high) at 32.7% vs. Gemini 3 Deep Think at 81.6% shows thinking capabilities are critical
  • Open-source models lag significantly — Qwen3 Next and DeepSeek R1 score under 15% on hard problems
  • The benchmark is continuously updated with new problems, so these scores reflect real generalization, not memorization

For the full, up-to-date leaderboard across all difficulty levels (Easy, Medium, Hard) and time windows, visit the project page linked in the next section.

Where to Explore the Benchmark

Leaderboard and Project

Resource Description Link
Project page Official website with live leaderboard, difficulty filters, and time windows livecodebenchpro.com
arXiv Paper Full technical paper with methodology, medalist analysis, and findings arxiv.org/abs/2506.11928
Evaluation Toolkit Local evaluation guide — plug in your own model interface LiveCodeBench Pro Toolkit Guide

Understanding the Metric

Pass@1

The primary metric is pass@1: the model generates a single solution, which must pass all test cases on the first attempt. This is the most stringent standard — no retries, no majority voting, no external tool augmentation.

Difficulty What It Means
Easy Straightforward implementation problems — models generally perform well
Medium Require correct algorithm selection and solid implementation — frontier models reach ~50%
Hard Demand creative algorithmic insight and complex case analysis — the true differentiator

Why “Hard” Matters Most

Hard problems in competitive programming are specifically designed to require:

  • Novel algorithmic insight — not just applying a known algorithm, but combining or inventing approaches
  • Tight constraint handling — solutions must be both correct AND efficient within time/memory limits
  • Exhaustive case analysis — missing a single edge case means a wrong answer

This is precisely where the medalist analysis found LLMs failing most — generating plausible but incorrect reasoning and missing the subtle observations that distinguish expert competitive programmers.

Why LiveCodeBench Pro Matters

graph LR
    A["Claims: AI surpasses<br/>elite humans<br/>in coding"] --> C["LiveCodeBench Pro<br/>tests this claim<br/>rigorously"]
    B["Contamination risk<br/>in static<br/>benchmarks"] --> C
    C --> D["Reveals true<br/>algorithmic<br/>reasoning gaps"]
    C --> E["Expert-annotated<br/>failure analysis"]

    style A fill:#e74c3c,color:#fff,stroke:#333
    style B fill:#e74c3c,color:#fff,stroke:#333
    style C fill:#27ae60,color:#fff,stroke:#333
    style D fill:#3498db,color:#fff,stroke:#333
    style E fill:#3498db,color:#fff,stroke:#333

  1. Challenges overblown claims — Rigorously tests whether LLMs truly surpass elite human programmers (they don’t, on hard problems)
  2. Contamination-free — Continuously updated problems from live contests prevent data leakage
  3. Expert diagnostics — Olympiad medalists provide uniquely qualified analysis of where and why models fail
  4. Fine-grained difficulty — Easy/Medium/Hard separation reveals that LLM strength is implementation, not reasoning
  5. Actionable insights — Line-by-line error analysis helps researchers target specific weaknesses in model reasoning

Video: LiveCodeBench Pro Explained

Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀

Conclusion

LiveCodeBench Pro sets a new standard for evaluating AI coding capabilities:

  • Problems from Codeforces, ICPC, and IOI — the most elite competitive programming platforms — continuously updated to prevent contamination
  • Olympiad medalists annotate every problem and conduct line-by-line failure analysis
  • The best models still fail on hard problems — the original paper found 0% pass@1 on hard problems without tools; even with the latest reasoning models, a significant gap remains
  • Implementation precision ≠ algorithmic reasoning — LLMs excel at clean coding but struggle with the creative insight that distinguishes expert programmers
  • Confidently wrong reasoning — models generate plausible-sounding but incorrect justifications, a critical reliability concern

As AI coding capabilities advance rapidly, LiveCodeBench Pro provides the essential ground truth: how far are we really from human grandmaster-level algorithmic reasoning? The answer, for now, is still quite far.

References

  • Zheng, Z., Cheng, Z., Shen, Z., Zhou, S., Liu, K., He, H., Li, D., Wei, S., Hao, H., Yao, J., Sheng, P., Wang, Z., Chai, W., Korolova, A., Henderson, P., Arora, S., Viswanath, P., Shang, J., & Xie, S. “LiveCodeBench Pro: How Do Olympiad Medalists Judge LLMs in Competitive Programming?” arXiv preprint arXiv:2506.11928 (2025). arxiv.org/abs/2506.11928
  • LiveCodeBench Pro. “Project Page.” livecodebenchpro.com
  • Jain, N., Han, K., Gu, A., Li, W., Yan, F., Zhang, T., Wang, S., Solar-Lezama, A., Sen, K., & Stoica, I. “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code.” arXiv preprint arXiv:2403.07974 (2024). arxiv.org/abs/2403.07974

Read More